LLMs training and tuning
See:
Resources
- Maxime Labonne - A Beginner’s Guide to LLM Fine-Tuning
- Finetune LLMs on your own consumer hardware using tools from PyTorch and Hugging Face ecosystem | PyTorch
- Fine-Tuning - LlamaIndex
- Maxime Labonne - Fine-tune Mistral-7b with Direct Preference Optimization
- Maxime Labonne - Fine-tune Llama 3 with ORPO
- Maxime Labonne - Fine-tune Llama 3.1 Ultra-Efficiently with Unsloth
- Efficient Fine-tuning with PEFT and LoRA | Niklas Heidloff
- From Zero to PPO: Understanding the Path to Helpful AI Models
- Pre-training
- Goal: Train a large language model (LLM) on vast amounts of text data to predict the next token.
- Objective: Minimize cross-entropy loss.
- Outcome: A model with broad general knowledge but limited alignment to human intent (e.g., helpfulness, honesty).
- Supervised Fine-Tuning (SFT)
- Purpose: Align the model with human-preferred conversational behaviors.
- Process:
- Fine-tune the pre-trained model on a curated dataset of high-quality, human-written prompt-response pairs.
- Limitation: Constrained by dataset size and quality.
- Rejection Sampling
- Purpose: Enhance response quality using human feedback.
- Process:
- Generate multiple responses per prompt.
- Human annotators rank or select the best response(s).
- Use the selected responses to further fine-tune the model.
- Limitation: Time- and resource-intensive.
- Reward Modeling
- Objective: Automate response evaluation to scale human feedback.
- Process:
- Train a reward model to predict human preferences.
- Input: Pairs of responses with human rankings.
- Output: Scalar reward values representing response quality.
- Use the reward model to score new responses.
- Train a reward model to predict human preferences.
- Advantage: Scalable and reduces dependency on manual evaluation.
- Reinforcement Learning with Human Feedback (RLHF)
- Objective: Optimize responses using reinforcement learning based on the reward model.
- Key Techniques:
- Proximal Policy Optimization (PPO): Stable and efficient RL algorithm.
- Training Loop:
- Generate a response for a prompt.
- Evaluate the response using the reward model.
- Update model parameters with PPO, balancing:
- Exploration: Discovering better responses.
- Exploitation: Refining known high-reward responses.
- KL Regularization: Penalizes excessive divergence from the pre-trained policy to retain general knowledge.
- Outcome: A model aligned with user intent and capable of producing helpful, safe, and relevant responses.
- Pre-training
Comparison of LLM tuning strategies
- Full Fine-Tuning, PEFT, Prompt Engineering, or RAG? (deci.ai)
- Domain specific generative AI: pre-training, fine-tuning & RAG — Elastic Search Labs
RLHF
- Illustrating Reinforcement Learning from Human Feedback (RLHF) (huggingface.co)
- ¿Qué es el RLHF? | IBM
- Reinforcement Learning from Human Feedback (RLHF) | Niklas Heidloff
- RLHF - Hugging Face Deep RL Course
- Illustrating Reinforcement Learning from Human Feedback (RLHF)
Code
- #CODE Axolotl
- Axolotl is a tool designed to streamline the fine-tuning of various AI models, offering support for multiple configurations and architectures
- Axolotl
- #CODE Unsloth
- #CODE Torchtune